In this paper, we attempt to examine whether the use of change-point analysis techniques is appropriate for detecting fatigue based on data captured from wearable sensors. As such, we perform a secondary data analysis to the data generated in: Baghdadi et al., 2018. The reader should note that their raw data was preprocessed using:

  1. Kalman Filter: Used to process the raw data from sensors to: (i) estimate the spatial orientation of the body with respect to the global reference frame, and (ii) to estimate the kinematics of motion.

  2. Segmentation: The motion segments where then segmented using an algorithm that assumes the existence of two peaks in the translational acceleration of the gait cycle. This assumption was justified based on the results of Tongen and Wunderlich, 2010 as well as the authors’ preliminary analyses.

The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.


1 Loading Data & Generating Features

The snippet below documents the list of R libraries that were used in this research. For convenience, we used the pacman package since it allows for installing/loading the needed libraries in one step.

rm(list = ls()) # clear global environment
graphics.off() # close all graphics
library(pacman) # needs to be installed first
p_load(R.matlab, plotly, extrafont, grDevices, gridExtra,
       dplyr, stringr, tidyverse, utils, reshape2,
       anomalize, forecast, MVN, fractal)

In the snippet below, we extract the 15 “.mat” files in the GitHub repository (where we loaded the data to allow for the reproduction of our work). Note that these files were originally produced in Baghdadi et al., 2018. Then, we perform several transformation steps: (a) extracting the data for the first three columns in the matlab arrays; and (b) computing three kinematic features from the data corresponding to these columns. Due to the differences between Matlab and R, this requires two nested for loops. The outer loop increments over the number of subjects, while the inner loop increments based on the different number of rows of data for each subject. Please see the comments within the code chunk for more details.

num_subjects <- seq(1, 15)
subject_heights <- c(1.71, 1.77, 1.71, 1.59, 1.69,
                     1.63, 1.60, 1.71, 1.67, 1.78,
                     1.68, 1.55, 1.83, 1.81, 1.89)

# Initilizing a df for summary data on participants
summary_df <- data.frame(matrix(nrow = 15, ncol = 9))
colnames(summary_df) <- c("Subject.Num", "num.rows",
                          "num.cols", "mean.scaled.stride.len",
                          "sd.scaled.stride.len",
                          "mean.scaled.stride.height",
                          "sd.scaled.stride.height",
                          "mean.stride.duration",
                          "sd.stride.duartion")

for (i in 1:length(num_subjects)) {
  # Reading the .mat files from GitHub
  raw_data <- readMat(paste0("https://github.com/fmegahed/fatigue-changepoint/blob/master/Data/Raw/Subject",num_subjects[i],".mat?raw=true"))
  # Compute the number of cells, and rows in each structered matrix
  raw_data_size <- lengths(raw_data) # num of cells
  num_rows <- raw_data_size / 17 # all data had 17 cols
  # Initilizing the six lists needed for storing the data (we keep track of the top 3 for error checking)
  time_in_sec <- vector("list", length = num_rows)
  position_x <- vector("list", length = num_rows)
  position_y <- vector("list", length = num_rows)
  stride_time <- vector("list", length = num_rows)
  stride_length <- vector("list", length = num_rows)
  stride_height <- vector("list", length = num_rows)
  stride_duration <- vector("list", length = num_rows)
  
  # Following for loop is needed since R reads the structured array as a nested list. The list containing the data is called "M.i.k" and it transforms/reads the original array --> rowise. This means that our first three features (with the same timestamp) are always seperated with a distance equal to the total number of rows
  for (j in 1:num_rows) {
    position_x[[j]] <- raw_data[["M.i.k"]][[j]]
    position_y[[j]] <- raw_data[["M.i.k"]][[num_rows + j]]
    stride_time[[j]] <- raw_data[["M.i.k"]][[2 * num_rows + j]]
    dataholder <- raw_data[["M.i.k"]][[16 * num_rows + j]] # data holder for time
    # Computing the three needed kinematic features
    stride_length[[j]] <-
      range(position_x[[j]])[2] - range(position_x[[j]])[1]
    stride_height[[j]] <-
      range(position_y[[j]])[2] - range(position_y[[j]])[1]
    stride_duration[[j]] <-
      range(stride_time[[j]])[2] - range(stride_time[[j]])[1]
    time_in_sec[[j]] <- lapply(dataholder, mean)# using mean time of stride as a time stamp
  }
  
  # Scaling and creating one data frame per subject
  assign(paste0("subject", i, "_features"), 
         data.frame(time.from.start = unlist(time_in_sec), 
                    scaled.stride.len = unlist(stride_length)/subject_heights[i], 
                    scaled.stride.height = unlist(stride_height) / subject_heights[i], 
                    stride.duration = unlist(stride_duration)
                    )
         )
  
  # Creating a Summary Data Frame
  df_name <- paste0("subject", i, "_features")
  summary_df[i, 1] <- paste0("subject", i)
  summary_df[i, 2] <- get(df_name) %>% nrow()
  summary_df[i, 3] <- get(df_name) %>% ncol()
  summary_df[i, 4] <- get(df_name)[, 1] %>% mean() %>% round(digits = 4)
  summary_df[i, 5] <- get(df_name)[, 1] %>% sd() %>% round(digits = 4)
  summary_df[i, 6] <- get(df_name)[, 2] %>% mean() %>% round(digits = 4)
  summary_df[i, 7] <- get(df_name)[, 2] %>% sd() %>% round(digits = 4)
  summary_df[i, 8] <- get(df_name)[, 3] %>% mean() %>% round(digits = 4)
  summary_df[i, 9] <- get(df_name)[, 3] %>% sd() %>% round(digits = 4)
}
# Printing the top six rows of Subject 4's data as an example
head(subject4_features) %>% round(digits = 3)
# A Summary of the features for all 15 participants
summary_df
rm(raw_data, raw_data_size, i, j, num_rows, 
   dataholder, num_subjects, subject_heights)
save.image(file = "./Data/RGenerated/FeatureGeneration.RData")

Based on the analysis above, there are three observations to be made. First, we scaled the stride length and height based on the subject’s height. This in essence allows us to capture the steps as a percentage of the person’s height. This reduces the between subject variablity in the data and is supported by the seminal work of Oberg et al., 1993. Second, we showed the first six rows from Subject 4 to provide the reader with some insights into the sampling frequeny (after the preprocessing done in Baghdadi et al. 2018). Note that the kinematic features are computed from the sensor channels provided in their paper. Third, we saved the generated data into an R.data file, which can be accessed by clicking on: FeatureGeneration.RData. We hope that this saved data allows other researchers to reproduce and/or build on our work.


2 Detecting and Removing Outliers

In this task, we examined several approaches for outlier detection. We originally hypothesized that these outliers constitue faulty sensor readings (since they are not sustained), and thus, we can remove this data prior to any further analysis. However, based on a detailed investigation of the data and the application of several outlier detection methods (univariate and multivariate), we have learned that the data is non-normal, highly autocorrelated and non-stationary. Thus, these approaches are not suitable for removing the outliers from our data. Thus, we then investiaged the utilization of “filtering” based methods in Section 3 to impute/correct/smoothen the data.

For the sake of reproducibility and completeness, we provide the results from our implementation of several outlier detection methods in the subsections below. Note that these methods will not be incorporated in our final analysis. They are presented here only to document how we reached our final model/approach.

2.1 Plotting the Data

First, we use a standard line plot to depict the data for each feature by person. For the sake of facilitating the visualization process, we: (a) panel the plot such that the left, center and right panels correspond to the scaled stride length, scaled stride height and step duration, respectively; and (b) we save the results of each participant (P) in a different tab.

# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
for (i in 1:15) {
  df_transformed <- melt(get(paste0("subject",i,"_features")),
                         id.vars = "time.from.start",
                         measure.vars=c("scaled.stride.len",
                                        "scaled.stride.height",
                                        "stride.duration"
                                        )
                         )
  assign(paste0("p",i),
         ggplot(data = df_transformed,
                aes(x=time.from.start,y=value,group=variable,color=variable,
                    shape=variable)) + 
           geom_line() + theme_bw() + 
           #ylim(0,2) +
           ggtitle(paste0("Participant ",i,": Pre-filtered Data")) + 
           theme(legend.position="none") + 
           facet_wrap(~variable,nrow=3, scales = "free_y")
         )
  cat('###',paste0("P", i), "{-}",'\n')
  print(get(paste0("p",i)))
  cat('\n \n')
  
  cat('<source> <p> Based on the plots above, we observed the presence of several outliers for each participant. For the <i>scaled stride length </i>, it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for <i>the scaled stride height </i>, we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be <b>removed</b>. We do not correct them since: (a) using a <i>mean imputation (or any other form of imputation)</i> will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see <a href = "https://link.springer.com/article/10.1023/B:AIRE.0000045502.10941.a9"> Hodge and Austin, 2004 </a>). Based on our dataset and plots, these errors will also be captured by a statistical approach. </p> </source>')
  
cat(' \n \n')
}

P1

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P2

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P3

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P4

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P5

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P6

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P7

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P8

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P9

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P10

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P11

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P12

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P13

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P14

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

P15

Based on the plots above, we observed the presence of several outliers for each participant. For the scaled stride length , it is unlikely that the stride length exceeds 1 for any of our participants since this means that their stride is greater than their height. For example, Participant 2 has one value that is greater than 1.75. This value is infeasible and reflects an error in either the segmentation and/or the sensor signal. Similar observations can be made for the scaled stride height , we expect that most of the observations to be less than or equal to 0.2. For an example, P10 has an observation which is about 0.5 and 5-6 observations that are greater than 0.2. These observations should be removed. These sensor/segmentation based errors will have to be removed. We do not correct them since: (a) using a mean imputation (or any other form of imputation) will affect our changepoint modeling results; and (b) removing them will have a negligible effect on both the sample size and more importantly, the interpretation of the sensor signal (since the data points are typically within 2 seconds). To remove them, can use statistical outlier detection approaches (see Hodge and Austin, 2004 ). Based on our dataset and plots, these errors will also be captured by a statistical approach.

2.2 Implementation of Several Outlier Detection Methods

2.2.1 Boxplots for Outlier Detection

One of the most commonly used approaches for univariate outliers are defined to be the observations that lie outside \(1.5 * IQR\), where the IQR is the inter quartile range. This can be easily implemented using in base R using the boxplot.stats function. Below, we first provide the output and data plots for each person.

# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
outliers_bp <- list() # initilizing a list to store outliers per participant
for (i in 1:15) {
  df <- get(paste0("subject",i,"_features")) # Getting Data from Sec.
  # Obtaining the Outliers for each of the three variables
  out_vals_sl<- boxplot.stats(df[,2])$out
  out_rows_sl <- which(df[,2] %in% out_vals_sl)
  out_vals_sh<- boxplot.stats(df[,3])$out
  out_rows_sh <- which(df[,3] %in% out_vals_sh)
  out_vals_sd<- boxplot.stats(df[,4])$out
  out_rows_sd <- which(df[,4] %in% out_vals_sd)
  # Generating a union set of all obs. that have outliers
  # True: if any of the 3 vars for that obs. is an outlier
  outliers_total <- unique(c(out_rows_sl, out_rows_sh,
                             out_rows_sd))
  outliers_bp[[i]] <- outliers_total # saving it to list indexed by participant number
  
  # Remove the observations corresponding to the outliers
  assign(paste0("subject",i,"_bp"), 
         df[-outliers_total,])  
  
  # Preparing the data for the Line Graph
  df_transformed <- melt(get(paste0("subject",i,"_bp")),
                         id.vars = "time.from.start",
                         measure.vars=c("scaled.stride.len",
                                        "scaled.stride.height",
                                        "stride.duration"
                                        )
                         ) # ggplot data needs to be tall
  
  assign(paste0("g",i),
         ggplot(data = df_transformed,
                aes(x=time.from.start, y=value, group=variable,
                    color=variable,shape=variable)) + 
           geom_line() + theme_bw() + 
           #ylim(0,2) +
           ggtitle("Outliers removed via the standard boxplot method \n (any point outside of 1.5*IQR was removed)") + 
           theme(legend.position="none",
                 #axis.text.x=element_text(angle=90,hjust=1),
                 plot.title = element_text(hjust = 0.5)) +
           facet_wrap(~variable,nrow=3, scales = 'free_y')
         )
  cat('####',paste0("P", i), "{-}",'\n')
  
    
    # Printing the % of outliers removed
    cat("\n")
    num_rows <- nrow(get(paste0("subject",i,"_features")))
    num_outliers <- length(outliers_total)
    percent_data <- round(100*num_outliers/num_rows,2)
    cat("<b>",paste0("% of removed Observations for Partcipant ",i,
                 ": ", percent_data,"%."), "</b>",
        "This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.")
    cat('\n \n')
  # Plotting the data without the outliers
  print(get(paste0("g",i)))
  cat('\n \n')
  
  cat('<source> <p> Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her\'s observation removed as a function of this approach. <b>This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers. </b> \n \n </p> </source>') 

cat('<source> <p>
We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the <i>tsoutlier() function</i> in the <a href="https://cran.r-project.org/web/packages/forecast/forecast.pdf">forecast package</a> and the <i> iqr() function </i> in the <a href="https://cran.r-project.org/web/packages/anomalize/anomalize.pdf">anomalize package</a> use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify. </p> </source>')

cat(' \n \n')
}

P1

% of removed Observations for Partcipant 1: 7.51%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P2

% of removed Observations for Partcipant 2: 5.75%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P3

% of removed Observations for Partcipant 3: 2.03%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P4

% of removed Observations for Partcipant 4: 11.04%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P5

% of removed Observations for Partcipant 5: 9.5%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P6

% of removed Observations for Partcipant 6: 21.21%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P7

% of removed Observations for Partcipant 7: 14.54%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P8

% of removed Observations for Partcipant 8: 5.1%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P9

% of removed Observations for Partcipant 9: 9.04%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P10

% of removed Observations for Partcipant 10: 12.6%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P11

% of removed Observations for Partcipant 11: 23.66%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P12

% of removed Observations for Partcipant 12: 12.14%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P13

% of removed Observations for Partcipant 13: 11.99%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P14

% of removed Observations for Partcipant 14: 9.33%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

P15

% of removed Observations for Partcipant 15: 14.24%. This removes any value outside 1.5 * IQR for any of the 3 features. From the code, it should be clear that we are implementing this columnwise, starting with stride length and ending with stride duration.

Based on the 15 charts above, one can see that using the standard boxplot approach for removing the outliers have resulted in a smoother time-series for all participants. However, the percentage of observations removed as a function of this application is high (i.e. >= 10) for partcipants 6-7, 10, 11-13 and 15. Note that for these participants, the percentage was always under 15%, with the exception of participant 11 who had 23.66% of his/her’s observation removed as a function of this approach. This obviously represents an unacceptable amount of data lost, and as such we will examine additional methods for removing/detecting outliers.

We could have tried making the value of the coefficient multiplied by the IQR coefficient larger (e.g., both the tsoutlier() function in the forecast package and the iqr() function in the anomalize package use a value of 3 for the coefficient); however, we did not explore this option since this: (a) would have possibly required a large amount of trial-and-error, and (b) the approach would be difficult to justify.

# Saving a list of cleaned 
save(subject1_bp, subject2_bp, subject3_bp, subject4_bp,
     subject5_bp, subject6_bp, subject7_bp, subject8_bp,
     subject9_bp, subject10_bp, subject11_bp, subject12_bp,
     subject13_bp, subject14_bp, subject15_bp, outliers_bp,
     file = "./Data/RGenerated/OutliersRemovedBoxplot.RData")

2.2.2 Mahalanobis Dist. based Method

# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
mal_results <- {}
for (i in 1:15) {
  df <- get(paste0("subject",i,"_features"))
  df <- df[,2:4]#only taking the four numeric fields
  cat('####',paste0("P", i), "{-}",' \n')
  cat('The output from the <source> <i> MVN package </i> </source> is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. </source> <b> Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. </b> </source>')

  plot.new()
  mvn_test <- mvn(df, mvnTest = "dh",desc = TRUE,
                univariateTest = "Lillie",
                multivariateOutlierMethod = "adj",
                showOutliers = TRUE) #adjusted Mahl Distance 
 cat(' \n')
  
  cat(paste0('In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant ',i,' below. \n \n'))
 mvn_Normality <- mvn_test$multivariateNormality
 rownames(mvn_Normality) <- paste('Participant',i)
 tab <- xtable::xtable(mvn_test$multivariateNormality,
                       align = c("c","c","c","c","c","c"))
 print(tab, type= 'html',
       html.table.attributes = 'align="center", rules="rows", 
                                width=50%, frame="below"')
  cat('<source> <p> \n </p> </source>')
  
  cat(paste0('Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant ', i, ' but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.'))
  
  cat(' \n \n')
  
}

P1

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 1 below.

Test E df p value MVN
1 Doornik-Hansen 4549.14 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 1 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P2

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 2 below.

Test E df p value MVN
1 Doornik-Hansen 6259.91 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 2 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P3

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 3 below.

Test E df p value MVN
1 Doornik-Hansen 8137.96 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 3 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P4

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 4 below.

Test E df p value MVN
1 Doornik-Hansen 11169.45 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 4 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P5

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 5 below.

Test E df p value MVN
1 Doornik-Hansen 14420.02 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 5 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P6

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 6 below.

Test E df p value MVN
1 Doornik-Hansen 19339.82 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 6 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P7

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 7 below.

Test E df p value MVN
1 Doornik-Hansen 9900.47 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 7 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P8

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 8 below.

Test E df p value MVN
1 Doornik-Hansen 6758.59 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 8 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P9

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 9 below.

Test E df p value MVN
1 Doornik-Hansen 20458.61 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 9 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P10

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 10 below.

Test E df p value MVN
1 Doornik-Hansen 9436.39 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 10 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P11

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 11 below.

Test E df p value MVN
1 Doornik-Hansen 15641.71 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 11 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P12

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 12 below.

Test E df p value MVN
1 Doornik-Hansen 7947.39 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 12 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P13

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 13 below.

Test E df p value MVN
1 Doornik-Hansen 16868.94 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 13 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P14

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 14 below.

Test E df p value MVN
1 Doornik-Hansen 3642.41 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 14 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

P15

The output from the MVN package is shown below. Note that we depict the multivariate normal distribution for the three features of each distribution below. Essentially, the goal here is two-fold. First, we would like to examine if the joint distribution of the three kinematic features can be modeled using the multivariate normal distribution. Second, if the first hypothesis is true, then we can use the Mahalanobis Distance to identify outliers in each observation. In addition to the plot, we provide the statistics of the Doornik-Hansen test for Participant 15 below.

Test E df p value MVN
1 Doornik-Hansen 7662.67 6.00 0.00 NO

Based on the MVN test and the plot above, it is clear that the data is not normally distributed. Thus, an outlier detection/removal algorithm based on a MVN distribution will be inappropriate. This observation is true to not only Participant 15 but to also all the other participants. Thus, in the following section we will investigate method(s) based on the signal processing literature detection literature.

3 Filtering Approaches

An alternative approach to detecting and then removing outliers is to use smoothing techniques to “correct” the data. Based on the results shown in the previous sections, the fundamental hypotheses behind this approach are:

  1. The sensors’ data seem to be non-stationary, autocorrelated and somewhat complex (from a statistical perspective). More importantly, there are spikes in the data that cannot be explained by the kinematic model. These spikes produce values that are unrealistic (e.g., a scaled stride length > 1 is not possible since that means that the stride of the person is larger than their height). While this insight is useful, there is no hard limit that we can enforce since the literature reports mean values across the population instead of attempting to provide a physical threshold on what is possible.

  2. The outliers tended to appear in pairs of two; indicating a possible issue in the segmentation of the data. Note that we label this data as outliers since it is not sustained (i.e., it is unlikely to say that a person is fatigued for only two stride).

  3. Smoothing (i.e. a low pass filter on the data) will allow us to impute values for every stride (after an initial size window). This will allow us to preserve the successive nature of the strides, “fix” the outliers based on neighboring strides, and not throw away any observation.

3.1 The Median Filtering Approach from the Signal Processing Literature

An algorithmic approach based on these observations is implemented below. Note that we are using the median filter from the fractal package on CRAN.

# We are reading the data locally (you can download the file from GitHub)
load(file = "./Data/RGenerated/FeatureGeneration.RData")
subject_heights <- c(1.71, 1.77, 1.71, 1.59, 1.69,
                     1.63, 1.60, 1.71, 1.67, 1.78,
                     1.68, 1.55, 1.83, 1.81, 1.89)
for (i in 1:15) {
  df <- get(paste0("subject",i,"_features")) # Getting Data from Sec.
  # Obtaining the Outliers for each of the three variables
  time.from.start <- df[,1]
  scaled.stride.len <- medianFilter(df[,2], order = 5) # moving window of size 5 obs
  scaled.stride.height <- medianFilter(df[,3], order = 5) # moving window of size 5 obs
  stride.duration <- medianFilter(df[,4], order = 5) # moving window of size 5 obs
 
  # Computing the first difference to make seein the values easier
  diff.time.from.start <- time.from.start[-1]
  diff.scaled.stride.len <- diff(scaled.stride.len)
  diff.scaled.stride.height <- diff(scaled.stride.height)
  diff.stride.duration <- diff(stride.duration)
  # Remove the observations corresponding to the outliers
  assign(paste0("subject",i,"_medianf"), 
         data.frame(cbind(time.from.start, scaled.stride.len, 
                          scaled.stride.height, stride.duration)))
  
  assign(paste0("subject",i,"_median_diff_f"), 
         data.frame(cbind( diff.time.from.start, diff.scaled.stride.len, 
                          diff.scaled.stride.height, diff.stride.duration)))
  
  # Preparing the data for the Line Graph
  df_transformed <- melt(get(paste0("subject",i,"_medianf")),
                         id.vars = "time.from.start",
                         measure.vars=c("scaled.stride.len",
                                        "scaled.stride.height",
                                        "stride.duration"
                                        )
                         ) # ggplot data needs to be tall
  
  diff_df_transformed  <- melt(get(paste0("subject",i,"_median_diff_f")),
                         id.vars = "diff.time.from.start",
                         measure.vars=c("diff.scaled.stride.len",
                                        "diff.scaled.stride.height",
                                        "diff.stride.duration"
                                        )
                         ) # ggplot data needs to be tall

  assign(paste0("g",i),
         ggplot(data = df_transformed,
                aes(x=time.from.start, y=value, group=variable,
                    color=variable,shape=variable)) + 
           geom_line() + theme_bw() + 
           ggtitle("Data post the application of the median filter") + 
           theme(legend.position="none",
                 #axis.text.x=element_text(angle=90,hjust=1),
                 plot.title = element_text(hjust = 0.5)) +
           facet_wrap(~variable,nrow=3, scales = "free_y")
         )
  
  assign(paste0("h",i),
         ggplot(data = diff_df_transformed,
                aes(x=diff.time.from.start, y=value, group=variable,
                    color=variable,shape=variable)) + 
           geom_line() + theme_bw() + 
           ggtitle("First Difference of the Signal post the median filter") + 
           theme(legend.position="none",
                 #axis.text.x=element_text(angle=90,hjust=1),
                 plot.title = element_text(hjust = 0.5)) +
           facet_wrap(~variable,nrow=3, scales = "free_y")
         )
  
  cat('###',paste0("P", i), "{-}",'\n')
  
  print(get(paste0("g",i)))
  
  cat('\n')
  
  cat(paste0('<source> <p> Based on the <b>median filter</b>, Participant ', i,
             ' has an average scaled stride length of ',
             round(mean(scaled.stride.len),2),
             ', an average scaled stride height of ',
             round(mean(scaled.stride.height),2),
             ' and an average stride duration of ', round(mean(stride.duration),2),
             ' seconds. Based on their height of ', subject_heights[i],
             ' meters, their average stride length and stride height are', ' ',
             round(mean(scaled.stride.len)*subject_heights[i],2),
             ' and ', ' ', round(mean(scaled.stride.height)*subject_heights[i],2),
             ', respectively.', '</p> </source>'
             )
      )
  
  cat('\n')
  
  print(get(paste0("h",i)))
  
  cat(paste0('<source> <p> Based on the <b>median filter </b>the interquarile range of the <b>first difference </b> of Participant ', i,
             '\'s scaled stride length, scaled stride height and stride duration are',
             ' ',
             round(IQR(diff.scaled.stride.len),3),
             ', ', round(IQR(diff.scaled.stride.height),3),
             ', and ', round(IQR(diff.stride.duration),3),
             ', respectively. </p> </source>'
             )
      )
    
  cat(' \n \n')
}

P1

Based on the median filter, Participant 1 has an average scaled stride length of 0.37, an average scaled stride height of 0.08 and an average stride duration of 1.3 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 0.63 and 0.14, respectively.

Based on the median filter the interquarile range of the first difference of Participant 1’s scaled stride length, scaled stride height and stride duration are 0.014, 0.004, and 0, respectively.

P2

Based on the median filter, Participant 2 has an average scaled stride length of 0.3, an average scaled stride height of 0.1 and an average stride duration of 1.01 seconds. Based on their height of 1.77 meters, their average stride length and stride height are 0.53 and 0.17, respectively.

Based on the median filter the interquarile range of the first difference of Participant 2’s scaled stride length, scaled stride height and stride duration are 0.014, 0.004, and 0, respectively.

P3

Based on the median filter, Participant 3 has an average scaled stride length of 0.2, an average scaled stride height of 0.08 and an average stride duration of 1.24 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 0.34 and 0.13, respectively.

Based on the median filter the interquarile range of the first difference of Participant 3’s scaled stride length, scaled stride height and stride duration are 0.018, 0.003, and 0, respectively.

P4

Based on the median filter, Participant 4 has an average scaled stride length of 0.21, an average scaled stride height of 0.09 and an average stride duration of 1.04 seconds. Based on their height of 1.59 meters, their average stride length and stride height are 0.33 and 0.15, respectively.

Based on the median filter the interquarile range of the first difference of Participant 4’s scaled stride length, scaled stride height and stride duration are 0.015, 0.005, and 0, respectively.

P5

Based on the median filter, Participant 5 has an average scaled stride length of 0.14, an average scaled stride height of 0.07 and an average stride duration of 1.1 seconds. Based on their height of 1.69 meters, their average stride length and stride height are 0.23 and 0.12, respectively.

Based on the median filter the interquarile range of the first difference of Participant 5’s scaled stride length, scaled stride height and stride duration are 0.01, 0.006, and 0, respectively.

P6

Based on the median filter, Participant 6 has an average scaled stride length of 0.09, an average scaled stride height of 0.04 and an average stride duration of 0.78 seconds. Based on their height of 1.63 meters, their average stride length and stride height are 0.15 and 0.06, respectively.

Based on the median filter the interquarile range of the first difference of Participant 6’s scaled stride length, scaled stride height and stride duration are 0.005, 0.002, and 0, respectively.

P7

Based on the median filter, Participant 7 has an average scaled stride length of 0.22, an average scaled stride height of 0.07 and an average stride duration of 1.17 seconds. Based on their height of 1.6 meters, their average stride length and stride height are 0.35 and 0.12, respectively.

Based on the median filter the interquarile range of the first difference of Participant 7’s scaled stride length, scaled stride height and stride duration are 0.019, 0.003, and 0, respectively.

P8

Based on the median filter, Participant 8 has an average scaled stride length of 0.36, an average scaled stride height of 0.08 and an average stride duration of 1.19 seconds. Based on their height of 1.71 meters, their average stride length and stride height are 0.61 and 0.13, respectively.

Based on the median filter the interquarile range of the first difference of Participant 8’s scaled stride length, scaled stride height and stride duration are 0.029, 0.005, and 0, respectively.

P9

Based on the median filter, Participant 9 has an average scaled stride length of 0.15, an average scaled stride height of 0.08 and an average stride duration of 1.17 seconds. Based on their height of 1.67 meters, their average stride length and stride height are 0.25 and 0.13, respectively.

Based on the median filter the interquarile range of the first difference of Participant 9’s scaled stride length, scaled stride height and stride duration are 0.012, 0.006, and 0, respectively.

P10

Based on the median filter, Participant 10 has an average scaled stride length of 0.25, an average scaled stride height of 0.09 and an average stride duration of 1.02 seconds. Based on their height of 1.78 meters, their average stride length and stride height are 0.44 and 0.16, respectively.

Based on the median filter the interquarile range of the first difference of Participant 10’s scaled stride length, scaled stride height and stride duration are 0.011, 0.004, and 0, respectively.

P11

Based on the median filter, Participant 11 has an average scaled stride length of 0.26, an average scaled stride height of 0.1 and an average stride duration of 1.1 seconds. Based on their height of 1.68 meters, their average stride length and stride height are 0.44 and 0.16, respectively.

Based on the median filter the interquarile range of the first difference of Participant 11’s scaled stride length, scaled stride height and stride duration are 0.013, 0.004, and 0, respectively.

P12

Based on the median filter, Participant 12 has an average scaled stride length of 0.31, an average scaled stride height of 0.1 and an average stride duration of 1.03 seconds. Based on their height of 1.55 meters, their average stride length and stride height are 0.48 and 0.16, respectively.

Based on the median filter the interquarile range of the first difference of Participant 12’s scaled stride length, scaled stride height and stride duration are 0.015, 0.006, and 0, respectively.

P13

Based on the median filter, Participant 13 has an average scaled stride length of 0.19, an average scaled stride height of 0.05 and an average stride duration of 1.21 seconds. Based on their height of 1.83 meters, their average stride length and stride height are 0.34 and 0.09, respectively.

Based on the median filter the interquarile range of the first difference of Participant 13’s scaled stride length, scaled stride height and stride duration are 0.011, 0.002, and 0, respectively.

P14

Based on the median filter, Participant 14 has an average scaled stride length of 0.45, an average scaled stride height of 0.1 and an average stride duration of 1.4 seconds. Based on their height of 1.81 meters, their average stride length and stride height are 0.82 and 0.17, respectively.

Based on the median filter the interquarile range of the first difference of Participant 14’s scaled stride length, scaled stride height and stride duration are 0.027, 0.008, and 0, respectively.

P15

Based on the median filter, Participant 15 has an average scaled stride length of 0.33, an average scaled stride height of 0.09 and an average stride duration of 1.09 seconds. Based on their height of 1.89 meters, their average stride length and stride height are 0.62 and 0.17, respectively.

Based on the median filter the interquarile range of the first difference of Participant 15’s scaled stride length, scaled stride height and stride duration are 0.014, 0.003, and 0, respectively.

# Saving a list of cleaned 
save(subject1_medianf, subject2_medianf, subject3_medianf, subject4_medianf,
     subject5_medianf, subject6_medianf, subject7_medianf, subject8_medianf,
     subject9_medianf, subject10_medianf, subject11_medianf, subject12_medianf,
     subject13_medianf, subject14_medianf, subject15_medianf,
     subject1_median_diff_f, subject2_median_diff_f, subject3_median_diff_f,
     subject4_median_diff_f,
     subject5_median_diff_f, subject6_median_diff_f, subject7_median_diff_f,
     subject8_median_diff_f,
     subject9_median_diff_f, subject10_median_diff_f, subject11_median_diff_f,
     subject12_median_diff_f,
     subject13_median_diff_f, subject14_median_diff_f, subject15_median_diff_f,
     file = "./Data/RGenerated/MedianFilteredData.RData")

  1. Department of Mechanical and Aerospace Engineering, University at Buffalo

  2. Department of Industrial and Systems Engineering, University at Buffalo

  3. Farmer School of Business, Miami University

  4. Farmer School of Business, Miami University. This author can be reached by email at fmegahed@miamioh.edu.